A Structural Segmentation of Songs Using Generalized Likelihood Ratio under Regularity Assumptions

نویسندگان

  • Gabriel SARGENT
  • Frédéric BIMBOT
  • Emmanuel VINCENT
چکیده

This document presents the algorithm submitted to the ”Structural segmentation” task at MIREX 2010. It consists in three parts. First, feature extraction (beat, MFCC, Chroma) from the song is achieved using existing scripts. Second, a segmentation is done according to three criteria for localizing statistical breakpoints, repeated feature sequences, and short events, using a filtered version of generalized likelihood ratio. The segment borders are then selected according to the amplitude of these criteria and a regularity constraint about the length of the structural segments searched. Third, the segments are gathered into similar classes using a hierarchical (agglomerative) clustering. The number of steps of this clustering is estimated separately for each song. 1. FEATURE EXTRACTION In the framework of MIREX, the songs’ sample rate is 44100 Hz. Three features are extracted for each input song : • beats (estimated by Daniel Ellis’s scripts 1 ) • MFCCs : 20 coefficients (including the 0th coefficient), with window size = 23.2 ms and window hop size = 11.6 ms (using scripts from MA toolbox by Beth Logan and Malcolm Slaney 2 ) • chroma vectors : 12 coefficients, with window size = 92.9 ms, and window hop size = 23.2 ms (using Daniel Ellis’s scripts 3 ) 2. STRUCTURAL SEGMENTATION OF THE AUDIO SIGNAL It is based on the calculation of three criteria extracted from the audio signal at the beat rate : 1 http://labrosa.ee.columbia.edu/projects/coversongs/ 2 http://www.ofai.at/ ̃elias.pampalk/ma/documentation.html 3 http://labrosa.ee.columbia.edu/projects/coversongs/ This document is licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 License. http://creativecommons.org/licenses/by-nc-sa/3.0/ c © 2010 The Authors. • criterion 1 evaluates the presence of statistical breakpoints in the MFCC sequence extracted from the audio file • criterion 2 evaluates the presence of repeated sequences of features, using the chroma vector sequence extracted from the audio file • criterion 3 evaluates the presence of short events, which uses the same MFCC description as the criterion 1 These criteria are calculated by a Generalized Likelihood Ratio, which compares the likelihood that the sequence of extracted features Y follows a particular assumption (let’s note it H0), or its opposite (H1), at each beat : GRL = P (Y |H1) P (Y |H0) (1) For criterion 1, the assumption H0 is that the MFCCs contained in a 12 s window centered on the current beat can be well modeled by a single Gaussian distribution. On the contrary, H1 assumes this group of MFCCs is well modeled by two Gaussian distributions (i.e. one Gaussian distribution models the sub-group of MFCCs located before the current beat, and another one models the MFCCs located after this beat). We therefore evaluate at each beat if it is better to assume H1 rather than H0 (it corresponds to high values taken by criterion 1) and this indicates possible assumptions on the location of the structural segment borders (a high value corresponding to a high border probability). Criterion 2 evaluates if every sequence of chroma vectors contained in a window of size 12 s centered on each beat is completely repeated in the rest of the song (H0), or if the two halves of this sequence are repeated separately (H1). The comparison between 2 sequences is made with the Euclidean distance. Criterion 3 considers for every beat a long window (12 s) and a short window (2 s) centered on each other. H0 assumes to model the MFCCs contained in the whole long window with only one Gaussian distribution, and H1 models these MFCCs by two Gaussian distributions : one modeling the MFCCs contained in the long window only those contained in the small window are excluded and another one modeling the MFCCs of the small window only. Our assumptions on the location of short events is shown with high peaks on the criterion 3 (which is evaluated the same way than criterion 1). The lengths of the different analysis windows have been tuned on a development set of 10 popular songs. These values have to be adjusted in the future on a wider corpus. The three criteria are filtered using the method proposed by Seck in the context of the segmentation of an audio flow into speech and music segments [3]. The peaks of the resulting criteria are used as assumptions on the location of the borders of the structural segments. The structural pulsation period [2] is estimated by Fourier Transform of criterion 1 (filtered version). Then, the selection of borders is made by dynamic programming, minimizing the amplitude of the 3 criteria between segment borders, combined with a penalty function which increases when segments’ length moves away from the estimated structural pulsation period. 3. LABELLING THE STRUCTURAL SEGMENTS For each segment obtained in the previous process, its MFCC sequence is modeled by a Gaussian distribution. We use the Gaussian parameters of the segments to compute a symmetrized Gaussian likelihood measure [1] in order to compare their timbral content. The segments are grouped using a hierarchical clustering algorithm [4]: First, each segment is assigned to a different class. At each iteration, the segments of the two classes which contain the most similar Gaussian models are grouped (the new class is modeled by the fusion of the 2 Gaussian models). The clustering is stopped at the iteration iS, which is estimated by modeling the set collection of minimal symmetrized Gaussian likelihood measure for each iteration by a bi-Gaussian model. It is assumed that these measures belong to one of the two following classes : • the measures resulting from the comparison of two segment classes associated to the same structural label, and • the measures resulting from the comparison of two segment classes associated to different structural labels. Classes are determined using a K-means algorithm (with K = 2). iS is estimated as the number of elements of the first class. iS is adjusted by two parameters a and b. We assume that the optimal iS (we note i ∗ S) and the estimated iS are linked by the following linear expression:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Automatic EEG Signal Segmentation Method based on Generalized Likelihood Ratio

It is often needed to label electroencephalogram (EEG) signals by segments of similar characteristics that are particularly meaningful to clinicians and for assessment by neurophysiologists. Within each segment, the signals are considered statistically stationary, usually with similar characteristics such as amplitude and/or frequency. In order to detect the segments boundaries of a signal, we ...

متن کامل

A Regularity-Constrained Viterbi Algorithm and Its Application to The Structural Segmentation of Songs

This paper presents a general approach for the structural segmentation of songs. It is formalized as a cost optimization problem that combines properties of the musical content and prior regularity assumption on the segment length. A versatile implementation of this approach is proposed by means of a Viterbi algorithm, and the design of the costs are discussed. We then present two systems deriv...

متن کامل

A Time-Frequency approach for EEG signal segmentation

The record of human brain neural activities, namely electroencephalogram (EEG), is generally known as a non-stationary and nonlinear signal. In many applications, it is useful to divide the EEGs into segments within which the signals can be considered stationary. Combination of empirical mode decomposition (EMD) and Hilbert transform, called Hilbert-Huang transform (HHT), is a new and powerful ...

متن کامل

Strong Topological Regularity and Weak Regularity of Banach Algebras

In this article we study two different generalizations of von Neumann regularity, namely strong topological regularity and weak regularity, in the Banach algebra context. We show that both are hereditary properties and under certain assumptions, weak regularity implies strong topological regularity. Then we consider strong topological regularity of certain concrete algebras. Moreover we obtain ...

متن کامل

An Evaluation of an Adaptive Generalized Likelihood Ratio Charts for Monitoring the Process Mean

When the objective is quick detection both small and large shifts in the process mean with normal distribution, the generalized likelihood ratio (GLR) control charts have better performance as compared to other control charts. Only the fixed parameters are used in Reynolds and Lou’s presented charts. According to the studies, using variable parameters, detect process shifts faster than fixed pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010